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Abstract — Pattern recognition systems based on compressed 
patterns and compressed sensor measurements can be designed 
using low-density matrices. We examine truncation encoding 
where a subset of the patterns and measurements are stored 
perfrectly while the rest is discarded. We also examine the use of 
LDPC parity check matrices for compressing measurements and 
patterns. We show how more general ensembles of good linear 
codes can be used as the basis for pattern recognition system 
design, yielding system design strategies for more general noise 
models. 

I. INTRODUCTION 

A recognition system has to be able to survive in a noisy 
environment subject to its own resource constraints. In most 
cases, including animals and machines, memory sizes are 
finite and sensory systems are only capable of extracting 
a fraction of information about an existing object. Also, a 
network decision system may consist of a sensing agent at 
one location, a database at a second location, and an action 
agent at a third location. The action agent needs data from the 
sensor and the database for recognition and subsequent actions. 
With bandwidth limitations on communication channels, the 
action agent must perform recognition based on compressed, 
maybe lossy, data from the sensing agent and the database. 
Westover [1] and Westover and O'Sullivan [2] derive inner 
and outer bounds of the achievable rate region of recognition 
systems using information theoretic arguments. While these 
results deepen our fundamental understanding about recog- 
nition systems, they do not provide a practical recognition 
system design. In [3], a recognition system design using 
low-density parity-check (LDPC) matrices is proposed for 
independent and identically distributed (i.i.d.) binary patterns 
under Bernoulli noise. Yet, in general, there are few guidelines 
for designing recognition systems under various noise and 
pattern assumptions. More general coding theory results are 
needed. 

In this paper, we establish coding theory type results for 
recognition system design for discrete patterns. We show 
that a good linear code always leads to a good recognition 
system design. The benefits of using linear codes are that 
the encoding complexity is low; there are many results on 
linear codes for various types of noise distributions; many 
linear codes have low complexity decoding algorithms, which 



allow one to design fast recognition algorithms. The connec- 
tions established in this paper allow one to bring successful 
results from linear code design to recognition system design. 
Under some conditions, we show that a linear encoding can 
outperform the inner bound of achievable rate region obtained 
by Westover [1]; see Westover andd O'Sullivan [2] for more 
detailed analysis of achievable rates. 

II. Problem Definitions 

Three aspects of the recognition problem we consider in this 
paper are the environment under which recognition takes place, 
the recognition system itself, and measures of performance. 
These follow the problem setting in [2]. 

The environment consists of six elements, denoted as 

s = (M c ,p j ,p x ,x,p ylx ,y). (i) 

M c = 2 nRc is the total number of objects to be recognized; 
R c is the pattern rate. Each pattern is a length n sequence 
with each element taking values over the set X. Here, we 
consider discrete patterns that each element of a pattern takes 
value over GF(r). Each pattern is drawn independently from 
a distribution P x , denoted as Xi,i G {1, 2, • • • , M c }. The 
set of all M c patterns to be recognized is denoted as C. In 
the training phase, we assume that a recognition system can 
observe Xi. In the testing phase, an object index j is drawn 
from {1, 2, • ■ • , Af c } based on an index distribution Pj. The 
corresponding object sequence Xj is then presented to the 
recognition system with noise whose transition probability is 
P y \ x , where each element of y takes values over the set y. 
Here we assume that Pj is the uniform distribution. Also, the 
noise, denoted as z, is assumed to be additive and modeled as 
a length n sequence over GF(r) drawn from a distribution P z , 
independent of Xj,Vi, and any design of recognition systems. 
Hence 

P v \ x {y\x)=Pz{y-x), (2) 

and the recognition system observes data 



y 



(3) 



where the addition is under GF(r). 

A recognition system consists of a sensory compression 
function g, a memory compression function /, and a recog- 
nition algorithm (p. The sensory compression function g maps 



an observed y G GF(r) n to a compressed sensory data 
a G GF(r) nRs , where R s is defined to be the sensory 
compression rate. Similarly, memory compression / maps 
each object sequence Xi G GF(r) n to a compressed sensory 
data s l G GF(r) nRm , where R m is the memory compression 
rate. For linear encoding cases, sensory compression and 
memory compression are done by using matrices G of size 
nR s by n and H of size nR m by n over GF(r), such that 



o = Gy 

is the compressed sensory data and 

S{ HX{ 



(4) 



(5) 



is the compressed memory data of the object with index i. 
The set of all memory data Sj, i G {1, 2, ■ • • , M c } is denoted 
as S. We are interested in designing good recognition systems 
given (R c , R m ,R s , P X ,P Z ). 

The recognition algorithm <j> takes S and a as inputs and 
computes an estimate j of the true object index. It consists of a 
noise estimation algorithm and an index estimation algorithm. 
The noise estimation algorithm is denoted as 



d{s u a) : GF(r) m x GF{r) 



\nR v 



GF(r)"U{e}, (6) 



that for each object index i, it computes an estimated noise 
under the hypothesis that the ith object is selected in the testing 
phase. The estimated noise of the ith object is denoted as 



Zi = d(si,a). 



(7) 



If the algorithm fails for the ith index, subject to some criteria 
of failure depending on the system design, d(-, •) outputs 
an error e. After the recognition system completes noise 
estimation for all indexes, it proceeds to index estimation. 
Since an index j is chosen uniformly in the testing phase, 
for index estimation, the index estimation algorithm simply 
selects the index estimate j to be the index associated with the 
largest P z (z i ), while we define P z (e) = 0. This means that the 
recognition system rejects indexes with noise estimation error. 
From now on in this paper, j always denotes the true object 
index selected in the test phase, and i G {1, • • • , M c } \ {j}. 

A recognition system makes an error if j ^ j. The average 
probability of error of an ensemble of recognition system 
design is defined to be 

p e= E P ¥= j\C,z,f,g)P c (C)P z (z)P f , g (f,g), (8) 
f,g,C,z 

which is averaging over all realizations of C, z, and the 
recognition system. Note that Pf^ g (/, g) is specified when the 
ensemble of recognition system designs is defined, and 



Pc(C) = l[P x (x). 



(9) 



Probability of error depends on the pattern length n. A 
three rate tuple {R Cl R m , R s ) is said to be achievable in an 
environment £ if there exists a recognition system such that 
P™ goes to zero as n goes to infinity. 



III. Truncation Encoding For i.i.d. Patterns and 
i.i.d. Noise 

In this section, we show that a truncation encoding outper- 
forms the inner bounds of achievable rate region of Bernoulli 
i patterns under Bernoulli noise obtained in [1] and [2]. This 
truncation encoding works for all GF(r), r > 2. It is assumed 
that each element of a pattern sequence is independent and 
identically distributed (i.i.d.) drawn from a distribution Q x 
on GF(r). Similarly, each element of the noise sequence 
is i.i.d. drawn from Q z on GF(r). Let H — [I n R m 0] and 
G = [I n R s 0], where I n R m and I n R s are identity matrices of 
size nR m and nR s respectively. Thus Si is the first nR m 
elements of Xi, and a is the first nR s elements of y = x } ■ + z. 
Let n m i n = min(nP m , nR s ). For any length n sequence a, 
a„ min denotes the sequence of the first n m j n elements of a, 
and a n ^. n denotes the rest of a. By definition, we know that 

Si,n m i n — *^i,n m i n and &n m i n — Vn m i n • 

The noise estimation algorithm works as follows. For each 
pair of (si,a), the algorithm checks if (si,n min , cr„ min ) is in 
the jointly typical set T%££, where the jointly typical set X^ in 
is defined as 

= {(x,y)GGF(rr^xGF(rr^: 
1 



■logP(x)-H(Q x ) 



l 



< e 



■logP(y)-H(Q x *Q z ) 



< e 



■\ogP(x,y)-H(Q x )-H(Q z 



(10) 



where Q x * Q z denotes the output distribution of a noisy 
channel with input distribution Q x and additive noise distri- 
bution Q z . It proceeds if (sj,n min , &n min ) G T%££, otherwise 
it outputs an e indicating an error. The algorithm computes 

^i,n mln — ^Vi m i n ^,n m i n — ^«,n m i n *^J,n m i n "F ^n m i n 7 (H) 

and then concatenates it with n — n min zeros to get the 
estimated noise Bi. Finally, the systems selects the index 

j = arg max P z (z k ) = arg max F 2 (4.n min ) 
fee{i,2,-,M c } fce{i,2,-,M c } 

(12) 

as its estimated index. 

Theorem 1 The probability of j ^ j goes to zero as n goes 
to infinity if 

R c < mm(R m , R S )(H(Q X * Q z ) - H(Q Z ) - 3e). (13) 

Proof: There are two situations under which the trun- 
cation encoding recognition system makes an error. The first 
situation is when (sj )Tlmin , <Jn min ) is not in T^;* . The second 
situation is when (sj tTlmin , cr nmin ) € ^nZll ^ ut tnere exists 
at least one other object index i such that (si,n min , < J „ mi „ ) G 
and P(z,) > P{zj). The probability of the first situation 
goes to e as n goes large, and e can be chosen to be arbitrarily 
small because of the standard property of jointly typical set. 



The probability of the second situation can be bounded by the 
probability that there exists at least one other object i with 
(si,n™»! ff n m m) € T* v f. Hence the probability of the second 
condition is bounded by 

E P(j) e p w 

jel.2,— ,M C ze{0,l}" 

P (3i : (x;,„ min , y nmin ) e TZlfJz, i) (14) 

zG{0,l}' 1 

(*i,fl-».ft^)6CT (15) 

= E ^c^-J^^J 

ze{o,i}" 

P(3i : (x i; „ mta ,Z/ nmi J G T^|z„ min ) (16) 

E P «..J E 

. e{0,l}"~' lmi " Z Ilm . n G{0,l}"min 

P(3z : (x 4 ,„ min ,y„ mi J G T^ n \z nmin ) (17) 

E P «»J 

2 "mi„ e 't ' 1 ^"~" mi " 

^:( Vinmll / IW )GT n ^) (18) 

< E p(^j 

z, c lm . n G{0,l}" _ "» i " 

E p (fe»™.Kje^;:) d9) 

v i={2,3,---,Af c } / 

( = 5 (M c -l)P((x iittBh) ^j£T*) (20) 

< (i + e ) 2 »^2-"-"( / ( X;y )- 3£ ) (21) 
= (1 + e )2-' l ( min ( i? '™- R -)( ff ( ,: 3-* , 3-)-- f/ ( ( 3-))-- R -- 3e )(22) 

where 

(a) follows from that j is uniformly distributed and all Xj is 
independently drawn from the same distribution; 

(b) follows from taking the union bound; 

(c) follows from that the terms inside the parenthesis of ( fT9] l 
is independent of 2° . ; 

(d) follows from the property of jointly typical set under 
the condition that if Xj, nmIn and yi,n min are indepen- 
dent with the same marginals as -P(xj,„ min , yi,n min ), 
then the probability that (a;i,n mto > j/„ mln ) G 7^;* < 
2 _(/( X "mi„ ; y"mi„)_ 3e ) ^ and e j ements of X j inmin and 

z„ min are i.i.d. hence so are elements of y„ min ; 

(e) follows from 

I{X;Y) = H(X + Z) - H(X + Z\X) (23) 

= H{Q X *Q Z )~H{Q Z ). (24) 

Thus if 

R c < min(i? m , R S )(H(Q X * Q z ) - H{Q Z )) - 3e, (25) 

The probability of recognition error goes to zero as n goes to 
infinity. ■ 



Corollary In particular, if elements of Xj are drawn from i.i.d. 
Bernoulli |, and noise is from i.i.d. Bernoulli q, we have the 
lower bound of possible R c to be 



R c < min(i? m , R,)(l - H(q)) - 3e, 



(26) 



where < min(i? m , R s ) < 1 and H (q) < 1. For i.i.d 
Bernoulli | source and any i.i.d Bernoulli g noise, this 
truncation encoding performs better then the ensemble of 
recognition system design based on LDPC matrices proposed 
by O' Sullivan and Lai [3], that in [3], it requires 



(27) 



R c < min(R mi R s ) - H(q) - e. 

Also notice that for R m = R s = R, the bound d26l > of 
R c is above the inner bound from [1] and is very close to 
the theoretical outer bound computed by Westover [1] and 
Westover and O' Sullivan [2]. They have shown an outer bound 
which is a concave function of R and is very close to the 
straight line R(l - H(q)). 

Here we discuss another interesting example where the 
noise distribution Q z is partially known. We assume that each 
element of x,; is i.i.d. drawn from the uniform distribution 
over GF(r). We assume that each element of z is i.i.d. drawn 
from a distribution Q z , but only Q z (0) = 1 — q is known (each 
element of z takes value with probability 1 — q). We want to 
find the least upper bound on R c among all such distributions 
given R = min(i? m , R s ) using truncation encoding. This is a 
constrained optimization problem 



m&xH(Q z ) subject to 



E * 

keGF(r) 



9,ft>0Vfc (28) 



where q^ = Q z (k). The maximum can easily be shown to be 
achieved for q^ — -^r Vfc ^ 0. The least upper bound of R c 
is then 



R logr + (1 - q) log(l - q) +glog 



<l 



1 



(29) 



where all logarithms are taken base 2. 

Note that there are noticeable differences between recog- 
nition and lossless source coding with side information. The 
bits useful in recognition systems are different from bits useful 
for lossless source coding. Also even if a joint lossless source 
code is available, it might not be good for recognition. Given 
two correlated sequences x and y, the achievable rate region 
of lossless source codes with side information obtained by 
Ahlswede and Korner [5] is 



Rx 
Pu 



> H(X\V), 

> I(Y;V), 



(30) 
(31) 



where V is an auxiliary random variable and X — Y — V is a 
Markov chain. For x being Bernoulli i, and y = x + z where 
z is Bernoulli q, R y = 1 and R x = H(q) is an achievable 
rate pair to reconstruct x and hence reconstruct z. However, 
Theorem 1 shows that it is not always necessary to reconstruct 
entire x or z for recognition. Also theorem 1, [1], and [2] 
all show that even if lossless coding is possible for a given 



recognition system with R m = R x , R s = R y , it is not good 
for recognition if the compression rates are below the required 
bounds. A large sensory compression rate R s — R y = 1 alone 
does not yield good performance because even if it is sufficient 
to reconstruct the true noise z, it is not sufficient to suppress 
the probability that there exists another pattern which is jointly 
typical with a sequence matching the compressed memory and 
sensory data. From a linear coding point of view with G for 
encoding x, the above argument means that the cardinality of 
each coset of G is too large to prevent that for all the 2 R ° — 1 
false objects, the coset G(xi + y) does not contain a sequence 
which is jointly typical with Xi. 

IV. Linear encoding for arbitrary independent 

NOISE 

Although the truncation encoding works well for i.i.d. 
Bernoulli patterns under i.i.d. Bernoulli noise condition, we 
shall see that there exists many cases where LDPC encoding 
proposed in [3], as well as several other linear codes or 
ensemble of linear codes, work reasonably well while no 
simple truncation encoding does. To see this, let us assume 
that elements of patterns are i.i.d. drawn from the uniform 
distribution over GF(r), denoted as Q x . The additive noise 
sequence is drawn from a distribution whose mean entropy 
is nR z for some < R z < 1. Under this loose constraint 
which allows nonstationary noise distributions, it might not 
be sufficient to have good statistical properties for recognition 
by simply computing the first n m i n elements of the noise 
sequence. Notice that when an LDPC matrix is used for 
compression, the codes used are viewed as LDGM codes, 
which are also known to have good performance for source 
coding and channel coding [6] [7]. 

Under the pattern and noise assumptions stated above, if the 
LDPC recognition system design proposed by O' Sullivan and 
Lai [3] is used, the following Theorem 2 can be proved. 

By good ensemble for generating LDPC matrices, we mean 
that the ensemble and noise average block decoding error goes 
to zero as n goes to infinity. By good recognition system design 
we mean that the ensemble and noise average recognition error 
goes to zero as n gets large. 

Theorem 2: If there exists an good ensemble for generating 
LDPC matrices of rate R — min(_R m , R s ), alone with a 
syndrome decoding algorithm under a noise distribution with 
entropy nR z , then there exists a good recognition system 
design using the same LDPC matrix ensemble and syndrome 
decoding algorithm for all R c < min(i? m , R s ) — R z . 

The proof is omitted since it follows directly from the 
following Theorem 3. 

Theorem 3 If there exists a good ensemble of linear 
codes of rate R = min(i? m , R s ) and a decoding algorithm 
for a noise distribution with entropy nR z . Then for all 
R c < min(R m ,R s ) — R z , there exists a good pattern 
recognition system design using the generator matrix of 
the linear block code, and the decoding algorithm as noise 
estimation algorithm under the same noise distribution. 



Proof: Without loss of generality, let us assume that 
R m < R s - Memory compression is done by using H, denoting 
a parity check matrix generated by the linear code ensemble, 
such that Si = Hxi. Sensory compression is done by a matrix 
G = [H T 0] T . Let d(-,-) denotes the syndrome decoding 
associated with the linear code ensemble with typical set 
check. The typical set check is done by verifying if Bi is in 
where 



= {z : 



-logP{z)-R 2 



(32) 



Because the probability of z T^ e is e which can be chosen 
to be arbitrarily small, and the decoding algorithm for inferring 
£j is good, we focus on the probability of index estimation 
error, similar to the proof of Theorem 1. The probability of 
index estimation error is less than 



E p c?) E 

J2 P(z)P(3i:z i GT^\z) 



P(z)P(3i:Zi€TZ' e \z,i) (33) 

(34) 

£ P(z)P(3i:d( Si ,a))eT^\z) 



(a) 



(b) 



(c) 
< 



(d) 
< 



E P(z)P(Bi: 

d(0,H(x l -x 1 +z))eT^\z) 
£ P(z)P(3i:d(0,H(x))GT^) 

2 nR c J- P(z)P(d{0,H(x))eT? e ) 

■ynR, 



2 nH 'P(d{0,H(x)) G T^) 
< 2 nR ° P{Hi = Hz\i) 



R„ 



(e) 
< 



<2 n Rc ^ y 2~ n 



(35) 

(36) 
(37) 

(38) 

(39) 
(40) 

(41) 
(42) 



where 

(a) follows from the construction of G based on H. 

(b) is because elements of Xi and x\ both are i.i.d. from 
the uniform distribution over GF(r), and Xi and x\ are 
independent of each other and independent of z, so that 
elements of Xj — x\ + z are also i.i.d. and uniformly 
distributed, denoted as x; 

(c) follows from union bound and there are totally 2 nRc — 1 
terms in the sum; 

(d) follows from that x is independent of z, see (b); 

(e) The cardinality of has upper bound 2 n< - H * +e ). Hence 
the probability of index estimation error goes to zero as 
n goes to infinity if 

R c < mm(R m , R s ) - R z - e. (43) 



Note that clearly if the complexity of the decoding algorithm 
is 0(f(n)), the complexity of the recognition system per 
object is also 0(f(n)). Hence Theorem 3 not only connects 
good linear code design to good recognition system design, it 
also connects low complexity algorithms for decoding linear 
code to noise estimation in recognition systems. 

LDPC codes can be used for non-i.i.d. noise. For example, 
Eckford, Kschischang, and Pasupathy [8] analyzed LDPC 
codes for Gilbert-Elliot Channels, which are binary symmetric 
channels with crossover probability depending on Markov pro- 
cesses, and Nicola, Alajaji, and Linder [9] developed decoding 
algorithms for LDPC codes with a queue-based channel. Based 
on Theorem 3 and [3], LDPC codes with the algorithms they 
developed can be used for good recognition system design for 
those noise models. 
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